

#### CSCI-GA.3033-004

#### Graphics Processing Units (GPUs): Architecture and Programming

#### **Lecture 3: Introduction to CUDA**

Some slides here are adopted from:

NVIDIA teaching kit

Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

http://www.mzahran.com



#### **GPU Computing Applications**

#### Libraries and Middleware cuFFT VSIPL PhysX CULA cuDNN MATLAB cuBLAS Thrust SVM OptiX cuRAND MAGMA NPP Mathematica TensorRT OpenCurrent iRay cuSPARSE Programming Languages Java Directives С C++ DirectCompute Fortran Python (e.g. OpenACC) Wrappers



#### CUDA-enabled NVIDIA GPUs

| Pascal Architecture<br>(compute capabilities 6.x)  | GeForce 1000 Series                      | Quadro P Series       | Tesla P Series                |
|----------------------------------------------------|------------------------------------------|-----------------------|-------------------------------|
| Maxwell Architecture<br>(compute capabilities 5.x) | GeForce 900 Series                       | Quadro M Series       | Tesla M Series                |
| Kepler Architecture<br>(compute capabilities 3.x)  | GeForce 700 Series<br>GeForce 600 Series | Quadro K Series       | Tesla K Series                |
| Fermi Architecture<br>(compute capabilities 2.x)   | GeForce 500 Series<br>GeForce 400 Series | Quadro Fermi Series   | Tesla 20 Series               |
|                                                    | Entertainment                            | Professional Graphics | High Performance<br>Computing |

### 3 Ways to Accelerate Applications

#### **Applications**

Libraries

Compiler Directives

Programming Languages

Easy to use
Most Performance

Easy to use Portable code

Most Performance Most Flexibility

## Parallel Computing on a GPU

- GPUs deliver up to 8,800+ GFLOPS
  - Available in laptops, desktops, and clusters
- GPU parallelism is doubling almost every year
- Programming model scales transparently
  - Data parallelism
- Programmable in C (and other languages) with CUDA tools
- Multithreaded SPMD model uses application data parallelism and thread parallelism.

[SPMD = Single Program Multiple Data]



**GeForce Titan X** 

| GPU       | Cores | Cores/SM | SM | Compute Capab. |
|-----------|-------|----------|----|----------------|
| GTX 980   | 2048  | 128      | 16 | 5.2            |
| GTX Titan | 2688  | 192      | 14 | 3.5            |
| GTX 780   | 2304  | 192      | 12 | 3.5            |
| GTX 770   | 1536  | 192      | 8  | 3.0            |
| GTX 760   | 1152  | 192      | 6  | 3.0            |
| GTX 680   | 1536  | 192      | 8  | 3.0            |
| GTX 670   | 1344  | 192      | 7  | 3.0            |
| GTX 580   | 512   | 32       | 16 | 2.0            |

Source: Multicore and GPU Programming: An Integrated Approach by G. Barlas

Latest compute capability (so far): 7.x

#### CUDA

- Compute Unified Device Architecture
- Integrated host+device app C program
  - Serial or modestly parallel parts in host C code
  - Highly parallel parts in device SPMD kernel C code



### Parallel Threads

- A CUDA kernel is executed by an array of threads
  - All threads run the same code (the SP in SPMD)
  - Each thread has an ID that it uses to compute memory addresses and make control decisions



### Thread Blocks

- Divide monolithic thread array into multiple blocks
  - Threads within a block cooperate via shared memory, atomic operations and barrier synchronization, ...
  - Threads in different blocks cannot cooperate



## Very similar to a C functionTo be executed on device

- Launched by the host

- All threads will execute that same code in the kernel.



- 1D or 2D (or 3D) organization of a block
- blockDim.x, blockDim.y, and blockDim.z
- gridDim.x, gridDim.y, and gridDim.z

1D, 2D, or 3D organization of a block
Block is assigned to an SM
blockldx.x. blockldx.y. and blockldx.z

- blockIdx.x, blockIdx.y, and blockIdx.z

Thread

• threadIdx.x, threadIdx.y, and threadIdx.z

### GPU Allows Automatic Scalability



Assuming you have enough parallelism

### Heterogeneous Computing



### IDs

 Each thread uses IDs to decide what data to work on

Block ID: 1D or 2D,or 3D

Thread ID: 1D, 2D, or 3D

 Simplifies memory addressing when processing multidimensional data

- Image processing
- Solving PDEs on volumes

<del>-</del> ...





```
// Compute vector sum C = A+B
void vecAdd(float* A, float* B, float* C, int n)
  for (i = 0, i < n, i++)
                                  GPU friendly!
    C[i] = A[i] + B[i];
int main()
{
    // Memory allocation for A h, B h, and C h
    // I/O to read A h and B h, N elements
    vecAdd(A h, B h, C h, N);
}
```

```
#include <cuda.h>
void vecAdd(float* A, float* B, float* C, int n)
 int size = n* sizeof(float);
 float* A_d, B_d, C_d;
                                                            Part 1
1. // Allocate device memory for A, B, and C
  // copy A and B to device memory
                                                Host Memory
                                                                Device Memory
2. // Kernel launch code – to have the device
                                                                     GPU
                                                   CPU
  // to perform the actual vector addition
                                                                     Part 2
3. // copy C from the device memory
  // Free device vectors
                                                            Part 3
```

## CUDA Memory Model

- Global memory
  - Main means of communicating R/W Data between host and device
  - Contents visible to all threads
  - Long latency access
- Device code can:
  - R/W per-thread registers
  - R/W per-grid global memory
- We will cover more later



### CPU & GPU Memory

- In CUDA, host and devices have separate memory spaces.
  - But ... Wait for lectures on advanced techniques
- If GPU and CPU are on the same chip,
   then they share memory space → fusion

Host

#### cudaMalloc()

Allocates object in the device Global Memory

- Requires two parameters

- Address of a pointer to the allocated object
- Size of of allocated object

#### cudaFree()

- Frees object from device Global Memory
  - Pointer to freed object



#### **Example:**

```
WIDTH = 64;
float * Md;
int size = WIDTH * sizeof(float);
```

cudaMalloc((void\*\*)&Md, size); cudaFree(Md);



- cudaMemcpy()
  - memory data transfer
  - Requires four parameters
    - Pointer to destination
    - Pointer to source
    - Number of bytes copied
    - Type of transfer
      - Host to Host
      - Host to Device
      - Device to Host
      - Device to Device
- Asynchronous transfer





cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

```
void vecAdd(float* A, float* B, float* C, int n)
 int size = n * sizeof(float):
  float* A d, * B d, * C d;
1. // Transfer A and B to device memory
  cudaMalloc((void **) &A_d, size);
  cudaMemcpy(A d, A, size, cudaMemcpyHostToDevice);
  cudaMalloc((void **) &B_d, size);
  cudaMemcpy(B d, B, size, cudaMemcpyHostToDevice);
  // Allocate device memory for C d
   cudaMalloc((void **) &C d, size);
                                                       How to launch a kernel?
2. // Kernel invocation code – to be shown later
3. // Transfer C from device to host
   cudaMemcpy(C, C d, size, cudaMemcpyDeviceToHost);
    // Free device memory for A, B, C
   cudaFree(A d); cudaFree(B d); cudaFree (C d);
```

```
int vecAdd(float* A, float* B, float* C, int n)
{
  // A_d, B_d, C_d allocations and copies omitted
  // Run ceil(n/256) blocks of 256 threads each
  vecAddKernel<<<ceil(n/256),256>>>(A_d, B_d, C_d, n);
}
  #blocks #threads/blks
```

```
// Each thread performs one pair-wise addition
__global__
void vecAddkernel(float* A_d, float* B_d, float* C_d, int n)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x; Unique ID
    if(i<n) C_d[i] = A_d[i] + B_d[i];
}
```

# Unique ID 1D grid of 1D blocks

blockIdx.x \*blockDim.x + threadIdx.x;

# Unique ID 1D grid of 2D blocks

blockIdx.x \* blockDim.x \* blockDim.y + threadIdx.y \* blockDim.x + threadIdx.x;



# Unique ID 1D grid of 3D blocks

```
blockIdx.x * blockDim.x * blockDim.y *
blockDim.z +
threadIdx.z * blockDim.y * blockDim.x +
threadIdx.y * blockDim.x +
threadIdx.x;
```

# Unique ID 2D grid of 1D blocks

```
int blockId = blockIdx.y * gridDim.x +
blockIdx.x;
```

```
int threadId = blockId * blockDim.x +
threadIdx.x;
```

# Unique ID 2D grid of 2D blocks

```
int blockId = blockIdx.x + blockIdx.y *
gridDim.x;
int threadId = blockId * (blockDim.x *
```

```
blockDim.y) +

(threadIdx.y * blockDim.x) +

threadIdx.x;
```

# Unique ID 2D grid of 3D blocks

# Unique ID 3D grid of 1D blocks

```
int blockId = blockIdx.x
  + blockIdx.y * gridDim.x
  + gridDim.x * gridDim.y * blockIdx.z;
```

```
int threadId = blockId * blockDim.x +
    threadIdx.x;
```

# Unique ID 3D grid of 2D blocks

# Unique ID 3D grid of 3D blocks

int blockId = blockIdx.x

```
+ blockIdx.y * gridDim.x
+ gridDim.x * gridDim.y * blockIdx.z;

int threadId = blockId * (blockDim.x *
   blockDim.y * blockDim.z) +
   (threadIdx.z * (blockDim.x * blockDim.y))
   + (threadIdx.y * blockDim.x)
   + threadIdx.x;
```

|        |                               | Executed on the: | Only callable from the: |
|--------|-------------------------------|------------------|-------------------------|
| device | <pre>float DeviceFunc()</pre> | device           | device                  |
| global | <pre>void KernelFunc()</pre>  | device           | host                    |
| host   | float HostFunc()              | host             | host                    |

- global defines a kernel function. Must return void
- <u>device</u> and <u>host</u> can be used together
- For functions executed on the device:
  - No recursion ... for now ... (but wait till we talk about dynamic parallelism)
  - No static variable declarations inside the function
  - No indirect function calls through pointers

### Note about \_\_\_device\_\_\_

- Kernel calls another kernel → Dynamic parallelism
- Needs compute capability 3.5 and higher.
- From second generation of Kepler
- Nested parallelism continues till depth 24.
- All child launches must complete in order for the parent kernel to be seen as completed.
- · More details later ....

#### **Data Parallelism:**

We can safely perform many arithmetic operations on the data structures in a simultaneous manner.







C adopts raw-major placement approach when storing 2D matrix in linear memory address.

A Simple main function: executed at the host

```
// Matrix multiplication on the (CPU) host
void MatrixMulOnHost(float* M, float* N, float* P, int Width)
                                                                                k
  for (int i = 0; i < Width; ++i)
     for (int j = 0; j < Width; ++j) {
        double sum = 0;
        for (int k = 0; k < Width; ++k) {
          double a = M[i * Width + k];
          double b = N[k * Width + j];
          sum += a * b;
        P[i * Width + j] = sum;
```

```
void MatrixMultiplication(float* M, float* N, float* P, int Width)
  int size = Width * Width * sizeof(float):
  float* Md, Nd, Pd:
1. // Transfer M and N to device memory
  cudaMalloc((void**) &Md. size):
  cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
  cudaMalloc((void**) &Nd. size):
  cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
  // Allocate P on the device
  cudaMalloc((void**) &Pd, size):
  MatrixMulKernel(Md, Nd, Pd, Width);
3. // Transfer P from device to host
  cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
  // Free device matrices
  cudaFree(Md); cudaFree(Nd); cudaFree (Pd);
```

```
// Matrix multiplication kernel - thread specification
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
  // 2D Thread ID
  int tx = threadIdx.x;
  int ty = threadIdx.y:
  // Pvalue stores the Pd element that is computed by the thread
  float Pvalue = 0:
  for (int k = 0: k < Width: ++k)
                                                                                                      tx
     float Mdelement = Md[ty * Width + k]:
     float Ndelement = Nd[k * Width + tx]:
     Pvalue += Mdelement * Ndelement:
  // Write the matrix to device memory each thread writes one element
  Pd[ty * Width + tx] = Pvalue:
                                                                                                Pd
                                                          Md
                                                                                                                       ty
                                                                                                      tx
       The Kernel Function
```

### More On Specifying Dimensions

```
// Setup the execution configuration
dim3 dimGrid(x, y, z);
dim3 dimBlock(x, y, z);
```

```
// Launch the device computation threads! MatrixMulKernel << dim Grid, dim Block >>> (Md, Nd, Pd, Width);
```

#### Important:

- dimGrid and dimBlock are user defined
- gridDim and blockDim are built-in predefined variable accessible in kernel functions

#### Dimensions

Grid Block

• The above thread configuration is launched via:

$$\begin{array}{lll} {\tt dim3 & block} \, (\, 3 \, \, , 2\, ) \; ; \\ {\tt dim3 & grid} \, (\, 4 \, \, , 3 \, \, , 2\, ) \; ; \\ {\tt foo} <\!\!<\!\!<\!\! {\tt grid} \; , \; \; {\tt block} >\!\!>\!\! >)() \; ; \end{array}$$

• The <<<>>> part of the launch statement, is called the execution configuration.

Source: Multicore and GPU Programming: An Integrated Approach by G. Barlas

#### Be Sure To Know:

- Maximum dimensions of a block
- Maximum number of threads per block
- Maximum dimensions of a grid
- Maximum number of blocks per thread

|                                     | Compute Capability        |            |                |             |  |
|-------------------------------------|---------------------------|------------|----------------|-------------|--|
| Item                                | 1.x                       | 2.x        | 3.x            | <b>5.</b> x |  |
| Max. number of grid dimensions      | 2                         | 2 3        |                |             |  |
| Grid maximum x-dimension            | $2^{16} - 1$ $2^{31} - 1$ |            | <del>- 1</del> |             |  |
| Grid maximum y/z-dimension $2^{16}$ |                           | $2^{16}$ . | - 1            |             |  |
| Max. number of block dimensions     |                           | 3          |                |             |  |
| Block max. x/y-dimension            | 512 1024                  |            |                |             |  |
| Block max. z-dimension              | 64                        |            |                |             |  |
| Max. threads per block              | 512 1024                  |            |                |             |  |
| GPU example (GTX family chips)      | 8800                      | 480        | 780            | 980         |  |

Source: Multicore and GPU Programming: An Integrated Approach by G. Barlas

### Tools

Integrated C programs with CUDA extensions (\*.cu files)



Heterogeneous Computing Platform with CPUs, GPUs

### Conclusions

- Data parallelism is the main source of scalability for parallel programs
- Each CUDA source file can have a mixture of both host and device code.
- What we learned today about CUDA:
  - KernelA<<< nBlk, nTid >>>(args)
  - cudaMalloc()
  - cudaFree()
  - cudaMemcpy()
  - gridDim and blockDim
  - threadIdx.x and threadIdx.y
  - dim3